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ABSTRACT 

As world-wide air traffic continues to grow even at a mod- 
est pace, the overall complexity of the system will increase 
significantly. This increased complexity can lead to a larger 
number of fatalities per year even if the extremely low fatal- 
ity rate that we currently enjoy is maintained. One impor- 
tant source of information about the safety of the aviation 
system is in Aviation Safety Text Reports which are writ- 
ten by members of the flight crew, air traffic controllers, 
and other parties involved with the aviation system. These 
anonymized narrative reports contain fixed-field contextual 
information about the flight but also contain free-form nar- 
ratives that describe, in the author’s own words, the nature 
of the safety incident and, in many cases, the contributing 
factors that led to the safety incident. Several thousand 
such reports are filed each month, each of which is read and 
analyzed by highly trained experts. However, it is possible 
that there are emerging safety issues due to the fact that 
they may be reported very infrequently and in different con- 
texts with different descriptions. The goal of this research 
paper is to develop correlated topic models which uncover 
correlations in the subspaces defined by the intersection of 
numerous fixed fields and discovered correlated topics. This 
task requires the discovery of latent topics in the text reports 
and the creation of a topic cube. Furthermore, because the 
number of potential cells in the topic cube is very large, 
we discuss novel methods of pruning the search space in the 
topic cells, thereby making the analysis feasible. We demon- 
strate the new algorithms on an analysis of pilot fatigue and 
its contributing factors, as well as the safety incidents that 
are correlated with this phenomenon. 
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1. INTRODUCTION 
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Many organizations have large text repositories that con- 
tain information that is mission critical to the organiza- 
tion. NASA, for example, operates a safety reporting system 
known as the Aviation Safety Reporting System (ASRS) 
which collects voluntarily submitted aviation safety inci- 
dent/situation reports from pilots, controllers, and others 
with the purpose to identify system-wide deficiencies and 
safety issues [5]. ASRS can receive as many as several thou- 
sand reports in a month and contains over 100, 000 reports at 
this time. The analysts at ASRS analyze each report in de- 
tail and assign reports to potentially several of 60 high-level 
anomaly categories and conduct other safety related studies 
with the reports. The reports also contain various ’fixed- 
field’ pieces of information that identify the context and op- 
erating conditions of the flight. The reports are anonymous 
by law; thus, the author, his or her organization, and other 
identifying pieces of information are removed from the re- 
ports. The ASRS analysts also use automated tools to help 
them study the reports and compare them with others, but 
the vast majority of the work is done by skilled experts. 

In the early 1990’s, ASRS personnel issued an alert to the 
Federal Aviation Administration (FAA) based on studies 
that were made on text reports submitted over the previ- 
ous several years. This study concluded that the Boeing 
757 generated wake-turbulence, a form of turbulence that 
is created behind an aircraft as it passes through the air. 
This dangerous phenomenon can lead to catastrophic conse- 
quences for smaller aircraft that are following the lead air- 
craft and is an important factor in determining the capacity 
of an airport [3]. 

When the B757 was initially put into service it had a wake- 
vortex classification that allowed a smaller separation be- 
tween it and a trailing aircraft. The alert issued in the late 
1980’s and 1990’s was noted by the FAA but, as it turns 
out due to unrelated reasons, they took action on reclassi- 
fying the 757 aircraft into one that requires a larger aircraft 
separation after two fatal accidents. Thus, although the reg- 
ulatory agency didn’t take action on this particular alert, it 
is an excellent example of the identification of precursors to 
catastrophic accidents based on the analysis of text reports. 

In principle, to detect and find corroborating evidence of 
this problem, the ASRS analysts had to comb through a 
huge amount of information in their system to detect and 
document this problem. The research discussed in this pa- 



per addresses the problem of providing automated methods 
to automatically identify potential precursors to safety in- 
cidents in large volumes of safety related reports using a 
combination of a correlated topic model and a powerful and 
scalable multidimensional cube. 

Several methods have been developed that can help enable 
the automatic classification [19] of text reports into anomaly 
categories [18] [15] and significant work has been performed 
in the area of correlated topic models [1], 

The problem that we focus on in this paper is the generation 
of a method to automatically uncover documents that are 
correlated with a topic of interest and then analyze the re- 
sulting set of reports using a scalable multidimensional cube. 
The multidimensional cube could consist of the fixed-fields 
already identified in a set of reports, but more interestingly, 
other topics that have been discovered in the text reposi- 
tory. These reports can offer a significant amount of insight 
into the main topic and its contributing factors. We use this 
as a running example throughout the paper to illustrate the 
performance and output of the system. 

For example, consider a study of pilot fatigue, which is 
thought to be a contributing factor to aviation safety in- 
cidents. The authors may not directly mention the word 
fatigue in their writeups. Instead, they may mention other 
phrases such as I was on the last leg of a 5 segment trip” 
or THIS WAS THE FINAL LEG OF A MULTI-LEG FLT 
AND I WAS MORE TIRED THAN I THOUGHT [20]. In 
these examples, the author does not directly state the word 
fatigue. In the last example, the author indicated that 
he/she was tired due to being on the final leg of a trip. 
Notice in this excerpt that the author uses abbreviations; 
ASRS documents are laden with abbreviations in the narra- 
tive sections. 

2. PROBLEM FORMULATION 

2.1 Preliminaries: The Text Cube Model 

A set of documents D is stored in an n-dimensional database 
DB = (Ai, A 2 , • • • , A n , D). Each row r £ DB corresponds 
to a document d £ D in the form of r = (oi, a 2 , • • • , a n , d), 
where a; £ Ai means the value of the dimension Ai for r is 
a-i . We denote r(D) = d and r(Af) = Oi. 

The data cube model [8] extended to the above multidimen- 
sional text database is called the text cube [13]. Several 
important concepts are introduced as follows. 

Definition 1. Text Cube: Cell and Measure. In the 

text cube built on a set of documents D, a cell is in the form 
of c = (ai, a 2 , • • • , a n : D' , fi(D’), f 2 (D'), ■■■ , fm(D') ), 
where either a; £ A; (i.e., the value of dimension Ai for c is 
ai) or m = * (i.e., the dimension Ai is aggregated in c). D' 
is the aggregated document set for c, formally defined as D' 
= {r | r £ DB, r(Ai) = a t if at ± *}. fi(D'), f 2 (D’), ■■■ , 
fm(D') are measures on D' that are computed by aggregate 
functions. We denote c(D) = D' and c(A;) = ai. 

Cells with m non-* dimensions are called m-dim cells. An 


n-dim cell is said to be a base cell, with no aggregated di- 
mension, and a 0-dim cell is the apex cell that aggregates 
all dimensions. 

Definition 2. Ancestor and Descendant. Cell c' is an 
ancestor of c (or c is a descendant of c' ) iff\f i : c' (Ai) ^ * 
=> c(Ai) = c'(Ai). Note a cell is an ancestor (or descendant) 
of itself. A base cell has no descendant except itself, and the 
apex cell has no ancestor except itself. 

Definition 3. Parents and Children are immediate an- 
cestors and descendants of a cell, respectively. Cell c' is a 
parent of c (or c is a child of c' ) iff (i) c' is an ancester of 
c, and (ii) c' is an i-dim cell while c is an (i-hl)-dim cell. 

Measures in a text cube are categorized into distributive, 
algebraic, and holistic [10] based on the way of aggregate 
functions used. 

Distributive : An aggregate function is distributive if the ag- 
gregate value of a cell c can be computed by only using the 
aggregate values of c’s children, e.g., count() and sum(). 

Algebraic: An aggregate function is algebraic if it can be 
computed by an algebraic function on a limited number of 
distributive measures, e.g., avg() and deviation(). 

Holistic: An aggregate function is holistic if there is no con- 
stant bound on the storage size needed to describe a subag- 
gregate, e.g., median]) and mode]). 

To make the time and space complexity affordable, in most 
cases, we require a measure for a text cube to be either 
distributive or algebraic. 

Definition 4. Topic. A semantically coherent topic in 
a text collection is represented by a topic model 9, which is 
a probabilistic distribution of words {p(w\9)} we w ■ W is the 
vocabulary. Clearly, we have = 1» 

2.2 Topic Correlation Analysis in Text Cube 

2.2.1 Problem 

We define the task of topic correlation analysis in text cube 
as follows. Given a text cube and k topics, we aim to answer 
the following two questions: 

Topic Relevance Analysis. Given a keyword query Q = 
{qi, q 2 , ■ ■ ■ , <?|q| }, what is the most relevant topic 8 to Q? 

Topic Correlation Analysis. Given a cell c, are there 
other topics a ^ 9 such that 8 and a are correlated in the 
scenario of c? The scenario of cell c is a condition, repre- 
sented by a selected set of dimensions, with some possibly 
instantiated, such as Weather = “Fog”, on which the topic 
correlations will be analyzed. 

Example 1. k = 3 topics are generated and the word 
distributions of topics are shown in Table 1. Notice that 




Dimensions 

Topic Distributions 

Doc 

State 

Time 

Weather 

Topic 1 

Topic 2 

Topic 3 

d\ 

IL 

night 

rain 

0.3 

0.1 

0.6 

d,2 

IL 

night 

snow 

0.6 

0.2 

0.2 


CA 

night 

snow 

0.2 

0.4 

0.4 

(I/}. 

CA 

daytime 

snow 

0.4 

0.3 

0.3 

d$ 

NY 

night 

snow 

0.1 

0.8 

0.1 

dg 

NY 

daytime 

rain 

0.5 

0.4 

0.1 

d'j 

NY 

daytime 

fog 

0.3 

0.3 

0.4 


Topic 1 

Topic 2 

Topic 3 

day 

0.057 

engine 

0.230 

factor 

0.040 

hour 

0.043 

oil 

0.004 

awareness 

0.015 

trip 

0.027 

shutdown 

0.013 

lack 

0.011 

time 

0.027 

pressure 

0.012 

fail 

0.011 

rest 

0.019 

start 

0.011 

performance 

0.009 

night 

0.019 

power 

0.007 

corrective 

0.009 

leg 

0.017 

temperature 

0.007 

attention 

0.007 

fatigue 

0.012 

landing 

0.005 

error 

0.007 

morning 

0.009 

compressor 

0.005 

action 

0.007 

long 

0.009 

viberation 

0.004 

realize 

0.007 

early 

0.007 

restart 

0.003 

poor 

0.006 

tired 

0.007 

fail 

0.003 

failure 

0.006 

sleep 

0.007 

filter 

0.003 

miss 

0.005 


Table 1: Word Distributions. 


the first topic contains many words that are related to be- 
ing tired or fatigued. The second topic contains words that 
describe potential issues in an engine, and the third topic 
contains words that are related to attention and awareness. 
In Table 2, a text cube is built on the document set D — 
{di,d 2 ,--- ,dr} with three dimensions ‘State’, ‘Time of the 
Day’ and ‘Weather’ as well as the topic distributions of doc- 
uments. Given the keyword query Q = { ‘I’, ‘am’, ‘tired’}, 
Topic 1 is regarded as the most relevant topic to Q. Given 
the cell ci = (*, ‘night’,*), Topic 3 is correlated to Topic 1 
in the scenario of ci. Also, given the cell C 2 = ( *,*,‘snow ’), 
Topic 2 is correlated to Topic 1 in C 2 - 

2.2.2 Motivation 

Topic relevance and correlation analysis in text cube is useful 
to aviation safety analysis for several reasons, including: 

• Many aviation safety databases such as ASRS 1 con- 
sist of both textual (e.g., the pilot report about the ac- 
cident) and multi-dimensional (e.g., ‘location’, ‘time’ 
and ‘weather’ associated with the pilot report) infor- 
mation, which can naturally fit in a text cube [13, 23]. 

• A topic in the aviation safety databases corresponds 
to an issue that that may explain what happened dur- 
ing the flight that caused the issue. For example, in 
Table 1, Topic 1 describes a ‘fatigue’ problem, which 
could be caused by ‘long duration trip’ or ‘early awak- 
ening from sleep’. Topic 2 contains ‘engine’ issues, and 
Topic 3 describes the ‘attention and awareness’ related 
issues. 

• Users of aviation safety databases may not have com- 
plete knowledge about a flight issue (i.e., a topic), but 
instead use a set of keywords to express their target 
topic. Based on this type of input, Topic Relevance 
Analysis could supply a way to match user queries to 
underlying topics. 

• To analyze correlated topics is to analyze correlated 
flight issues. The latter can facilitate or be a funda- 
mental component for many aviation safety applica- 
tions, such as classification [15], causal analysis [3] and 
error source finding [11], 

1 http://asrs.arc. nasa.gov/ 


Table 2: Text Cube and Topic Distributions. 

Organization. We organize the rest of this paper as fol- 
lows. Section 3 generates topics in the preprocessing step. 
Section 4 and Section 5 propose solutions to the Topic Rele- 
vance Analysis and the Topic Correlation Analysis problems 
respectively. Section 6 performs experimental studies on a 
real aviation safety database, and finally, Section 7 concludes 
the whole paper. 

3. PREPROCESSING 

Documents in aviation safety databases contain many abbre- 
viations, acronyms and phrases. Abbreviations and antonyms 
can be transformed to their original complete formats by 
utilizing domain dictionaries, but phrases are much more 
difficult to handle because of the large number of potential 
combinations. 

To overcome this problem, sequential pattern mining tech- 
niques are developed in [6], by which each detected pattern 
(i.e., a set of keywords that appear together frequently) is re- 
garded as a phrase and appearances of phrases in documents 
are replaced by special terms that stand for corresponding 
phrases. 

After phrases are replaced by terms, k topics are generated 
by running LDA (Latent Dirichlet Allocation) [2]. For the 
rest of this paper, we assume both word distributions over 
topics (i.e., Pr(w\9) for a word w and a topic 6) and topic 
distributions over documents (i.e., Pr(9\d) for a topic 9 and 
a document d £ D) are prior knowledge. 

4. TOPIC RELEVANCE ANALYSIS 

In this section, we formally define the problem of topic rele- 
vance analysis as: given a keyword query Q = {qi , q2, • • • , 
q\Q\ }, which topic 9 maximizes the relevance score Rel(Q, 9) 
based on Q: 

Rel(Q, 9) = Pr(9\Q) = (1) 

4.1 Relevance Function 

According to the theory of probability we convert the second 
component in the numerator of Equation 1 to be as 

Pr(9) = Pr(9\d)Pr(d), 

deD 



where Pr(d) is supposed to conform to uniform distribution, 
i.e., Pr(d) = pjj. 

Following unigram topic modeling algorithms [17, 12], we as- 
sume the independence among words, so that the first com- 
ponent in the numerator and the dominator of Equation 1 
becomes: 

Pr(Q\d) = n Pr( qi \8) 

Pr(Q) Pr(qi) 

where Pr(qi) equals the occurence of qt divided by the total 
occurence of all words in D , i.e., 

= count{ qi ,D) 

{qi! E count{w,D) 

w£W 

Recall that W is the vocabulary for any topic. 


Note in this case, we need additional 0(k ) space to store 
the pre-computation results. Usually, both k and |Q| are 
small. For example, in our experimental study on the ASRS 
dataset, k is 100, and |Q| is no more than 10 words. Hence, 
the efficiency is guaranteed to be good, which is generally 
fast enough to respond to any online query. 

5. TOPIC CORRELATION ANALYSIS 

In this section, the formula of topic correlation score is dis- 
cussed in Section 5.1, and a dilemma regarding the com- 
putational issue is stated in Section 5.2. To overcome the 
dilemma, we propose the idea of partially materializing the 
text cube. Hence, Section 5.3 explains how to process queries 
in a partially materialized text cube, and Section 5.4 intro- 
duces how to select cells for pre-computation. 


Finally, Equation 1 turns out to be: 
1 


Rel(Q,0)= ( 


( 2 ) 


dGD 


/ Qi&Q 


Example 2. Following Example 1, a keyword query is 
given as Q = { Hired', Hong', 'trip'}, and we calculate 
the relevance scores for each of the three topics. Since only 
words with the highest probabilities are listed, we simply as- 
sume the probability of an unlisted word in a topic equals to 
0.001. Plus, for the prior of keywords, we have Pr(’tired’) 
= 0.01256, Pr(‘long ’) = 0.06675, and Pr(‘trip’) = 0.05573. 
According to Equation 2, we get these relevance scores as 
Rel(Q, Topic 1) = 0.01248, Rel(Q, Topic 2) = 7 .643e-6, 
and Rel(Q, Topic 3) = 6.688e-6, among which Topic 1 is 
the most relevance topic to the query Q. Note that the result 
is approximate because of the assumption that an unlisted 
word has a generative probability 0.001. 

4.2 Complexity Analysis 

4.2.1 Complexity without Pre-computation 

After the keyword query Q arrives, we scan the k topics 
one by one. For each topic 9, we use Equation 2 to calcu- 
late Rel(Q, 6) and output the best topic that maximizes the 
relevance score. The computational cost for the first and 
the second parts of Equation 2 are 0(|D|) and 0(|Q|), re- 
spectively. Hence, the overall computational complexity for 
exhausting all topics is 

Q(k(\D\ + |QD). 


5.1 Correlation Score 


In traditional topic correlation analysis [1, 16, 21], two topics 
are correlated if they have the same or similar context. The 
so-called ‘context’ is usually explained as a corpus. Formally, 
a typical way to define the correlation of two topics a and 
P over a corpus D = {di, d 2 , • • • , d|u|} is to calculate the 
angle of two vectors: 


Col(a,/3 ) = Cosine^ {a ),\? (£)) 


?(a)-?G0) 


where 


is the topic distribution vector of a, i.e., 


V^(a) = (FV(a|di), Pr(a|d 2 ), • • • ,Pr(a\d ]D] )) . (3) 


To consider the above topic correlation problem in the sce- 
nario of the text cube, we regard the ‘context’ to be cells. 
Concretely, for a cell c, we re-define Equation 3 as 

V^c(a)= (Pr(a|di),f’r(a|d 2 ),-- - , Pr(a\d\ c(m )) , 

where c(D) = {d{, d 2 , • • • , d'^ D ^} is the aggregated docu- 
ment set for the cell c. 


To sum up, the correlation score of two topics a and ft in 
the cell c equals to 


Cole {a, ft) 


\?c(a)-V c (P) 

iiT ? c (a)ii \\V c m\ 


(4) 


An aviation safety database usually stores a vast amount of 
records, e.g., in our ASRS dataset, we have 61,235 flight 
records for 10 years, hence the above time complexity is 
unsatisfactorily large for online queries. 

4.2.2 Complexity with Pre-computation 

It is easy to see that Pr(9\d) is independent of queries, so 
we can pre-compute and store Pr(6 ) for each topic 9, which 
results in reducing the overall computational cost to 

0(k(l + \Q\)) = 0(k\Q\). 


Example 3. Following Example 2, consider three cells ci 
= (*, l night' , *) and c? = (*, *, ‘snow'). For ci, the topic 
distribution of the three topics (0.3, 0.6, 0.2, 0.1), (0.1, 0.2, 
0.4, 0.8) and (0.6, 0.2, 0.4, 0.1), respectively. The correlation 
score between Topic 1 and Topic 2 is 0.4755, and the one 
between topic 1 and topic 3 is 0.7305. For c 2 , the topic 
distribution of the three topics (0.6, 0.2, 0.4, 0.1), (0.2, 0.4, 
0.3, 0.8) and (0.2, 0.4, 0.3, 0.1), respectively. The correlation 
score between Topic 1 and Topic 2 is 0.5494, and the score 
between Topic 1 and Topic 3 is 0.7980. It is observed that 
Topic 1 and Topic 3 are correlated in the scenarios of ‘night’ 
and ‘snow’. 



5.2 Full Cube Computation 

Since the number of topics (i.e., k) is small, the problem of 
finding correlation topics could be split into first calculating 
Cole (a, 9) for each topic a (a ^ 9) and then outputting 
the best a that maximizes the correlation score. When the 
queried cell c comes, we can simply scan all documents in c 
and compute Equation 4 as: 

22 Pr(a\d)Pr(9\d) 

Col c (a, 9) = rfec(g) (5) 

22 Pr 2 (a\d) £ Pr 2 (9\d) 

Y dec(D) Y dec(D) 

However, the computational cost is 0(|c(D)|), i.e., number 
of documents in cell c, which is too large to guarantee in time 
response to online queries. To reduce the online computa- 
tional cost, we can offline compute and store some values, so 
that answering online queries can be accelerated by utilizing 
stored values. This step is called materialization [8] of the 
data cube. 

Concretely, we decompose Equation 5 into two parts: 

1. SSg(c) = 22 Pr 2 (9\d) for each topic, and 

d£c(D) 

2. DAIg lt g 2 (c) = 22 Pr(9\\d)Pr(92\d) for each pair of 

d£c(D) 

topics. 

For the convenience of expression, we abbreviate (S'S'etc)}# 
and {DMg li g 2 (c)}g ll e 2 as SS(c) and DAI(c), respectively. 

The simplest algorithm to compute measures in a full 71 - 
dimensional text cube is: first compute all cells in the n-D 
cuboid (i.e., base cells); then compute all cells in the (n-1)- 
D cuboid; • • • ; finally compute the 0 -D cuboid (i.e., the 
apex cell). After the full text cube is computed, any topic 
correlation scores can be queried by directly retrieving SS () 
and DM() and doing some simple computation. However, 
the key points of such materialization are: (i) how much the 
storage cost is; and (ii) how an r-D cell is aggregated from 
some (r-l)-D cells without looking at the original database. 
We will discuss the two issues: 


and A n has m distinct values a n , 1 , a n , 2 , • • • , a n ,m, so 
we have c’s children as Cj = (a 1 , 02 , ■ ■ ■ , a n j : Cj(D), 
SS(cj), DM(cj)) for j = 1, 2, • • • , m. It is easy to 
prove that SS(c) and DM(c) can be efficiently aggre- 
gated from SS(cj) and DM(cj) as 

SSg(c) = Y,SSg( Cj ) 
c o 

DMg 1 : g 2 (c) = DMg 1: g 2 (cj) 

c 3 

5.3 Query Processing 

in Partially Materialized Cube 

Although SS and DM can be efficiently aggregated, un- 
like distributive/algebraic measures in traditional data cube, 
they consume a huge amount of space if materialized for all 
cells, which is not affordable for an aviation safety database. 
Therefore, in this subsection, we introduce how to process 
the topic correlation queries in a text cube where only a sub- 
set of cells are precomputed; and in Section 5.4, we discuss 
how to optimize the storage size by appropriately choosing 
the subset. 

A text cube is said to be partially materialized if a subset of 
cells are precomputed while the rest are not. In such a text 
cube, a query should be processed as: 

1. If the corresponding cell is precomputed, the value 
stored can be directly returned; 

2. Otherwise, we can obtain this cell by aggregating a set 
of precomputed cells. 

For a non-materialized cell, there are different ways of choos- 
ing the set of precomputed ones to obtain the inquired cell. 
So we have the chance to choose the ‘optimal’ set which 
incurs the minimum cost in the aggregation process. To for- 
mally define the query processing problem, we first need to 
introduce the concepts of decision space and cost model. 

Decision Space. For a non-materialized queried cell 
c = (ai, a 2 , ••• , a n : c(D), SS(c), DM(c)). W.l.o.g, 
suppose a; = * for i = 1, 2, • • • , n' , and ai € A., for 
i = n' + 1, n 1 + 2, ■■■, n. We furthermore denote 
Ci,j — (ttl , 02 , *** , ai — 1, Oi^j, (Zj-f-1, ••• , On • ^*,j(Tl), 
S'S'(cij), DM(aj)) for i = 1,2, ••• ,n' and aij £ Ai. 
We have n! choices to aggregate c, i.e., the so-called 
A;-based aggregation is to aggregate c from the set of 
cells {cip, Cj, 2 , •• • ,Ci,\A t |}- W.l.o.g., suppose we select 
to aggregate c based on Ai, then for each cij, if it is 
pre-computed, it can be directly retrieved; otherwise, 
recursively, to obtain SS and DM for this cell, we 
have n ! - 1 choices to aggregating other cells on one of 
dimensions A 2 , A 3 , • • • , A n /. 


Storage cost. In tradition data cube, usually only 
0(1) space is required by the measure of each cell; how- 
ever, for SS(c) and DM (c), we need as much as 0(k 2 ) 
storage size. For aviation safety databases, problems 
(i.e., topics) that happened during the flight are com- 
plex, diverse and variant. For example, in our exper- 
iments, there are as many as 100 topics in the ASRS 
datasets, whose range covers ‘environmental facts’, ‘hu- 
man factors’, ‘engine problem’, etc.. Such special situ- 
ations cause the sharp enlargement of the storage size 
compared to traditional data cubes. Although storage 
is more and more cheap, such space complexitv is still 


' Cost Model. Given a queried cell c, the cost of pro- 

cALcool V t) . 

cessing the query is the number of precomputed cells 
Aggregation. Both SS and DAI are distributive we need to access. Particularly, if c corresponds to an 

measures [8]. Let c = (ai, 02 , ■ ■ ■ , a n : c(D), SS(c), empty cell, the cost defined to be 0. If c corresponds 

DAI(c )) be an (r-l)-D cell. W.l.o.g., suppose a n = * to a precomputed cell, the cost is 1. 



Now the query processing problem turns out to be: what 
is the best way to aggregate a queried cell c so that the 
cost is minimum? We use the dynamic programming al- 
gorithm Aggregate (c) to recursively compute the optimal 
cost/decision. Cost(c) denotes the optimal cost of a queried 
cell c, and Best(c) denotes the corresponding set of cells 
that need to be accessed under the optimal cost. Of course, 
\Best(c)\ = Cost(c). We compute Cost(c) and Best(c) case 
by case: 

1. Cost(c) = 0, if c corresponds to an empty cell. In this 
case, Best(c) = 0, and we respond to the query by 
returning SSe(c) = 0 for any topic 9 and DMg lt e 2 (c) = 
0 for any pair of topics #i and 82 - 

2. Cost(c) = 1, if c corresponds to a pre-computed cell. 
Here, Best(c) = {c}, and we answer the query by di- 
rectly retrieving the stored values. 

3. Let i' = argmin I JZ Cost(cij) J , if c corresponds 

i,c( A i)=* \ a i,j£ A i J 

to an non-empty, non-materialized cell. In this situa- 
tion, 

Cost(c ) = Cost(ci'j) 

a i’,j£ A i' 

Best(c) = Best(ci/j) 

a i',i^ A i' 

If Ci>j is not materialized, we repeat the same proce- 
dure. 

Algorithm 1 shows the pseudo code of Aggregate (c). 


Algorithm 1 Aggregate(c) 

ALGORITHM: 

if c is empty then 
Cost(c) <— 0; Best(c) <— 0; 

return ; 
end if; 

if c is materialized then 
Cost(c) i— 1; Best(c) <— {c}; 

end if; 

Cost(c) < — l-oo; 
for each i s.t. c(Ai ) = * do 
CurCost = 0; 
for each aij G Ai do 
Aggregated, j ); 

CurCost t— CurCost + Cost(dj ); 

end for 

if CurCost < Cost(c) then 
Cost(c) <— CurCost ; 

Best(c ) <— 1J Best(cij)-, 

a,i t j 

end if 
end for 


5.4 Optimizing Space Cost 

with Bounded Query Processing Cost 

The remaining question is how to choose a subset of cells to 
precompute, s.t. 


(i) Any query can be answered successfully. 

(ii) For any cell c, Cost(c) is bounded by a user-specified 
threshold e. 

(iii) The storage cost (i.e., the total number of precomputed 
cells) is as small as possible. 

Since base cells can not be aggregated from other cells, they 
must be precomputed. For non-base cells, we define a topo- 
logical order on these cells according to their granularity 
levels, i.e., in the order, an r-D cell is put before an (r-l)-D 
cell. The intuition of how to select cells for precomputation 
is: we scan cells in the topological order one by one; for a 
scanned cell, we precompute cells as later as possible in the 
topological order. That is to say, we scan non-base cells one 
by one. For a cell c, if Cost(c) does not exceed the threshold 
e, we delay its computation to the online query processing, 
because the query time is still well bounded; otherwise, we 
materialize c. Such method is called T — CUBING , de- 
scribed in algorithm 2. 


Algorithm 2 T-CUBING(c, e) 
ALGORITHM: 

if c is a base cell then 

precompute c; Cost(c) <— 1; 
else 

Aggregated ; 
if Cost(c) > e then 

precompute c; Cost(c) t— 1 ; 

end if 
end if 


The time complexity of T-CUBING for each cell is 
0(Amax(|A;|)) 

i 


6. EXPERIMENTAL STUDY 

ASRS (Aviation Safety Reporting System) 2 is a voluntary 
system run by NASA, that allows pilots and other airplane 
crew members to confidentially report aviation related safety 
incidents in the interest of improving air safety. An online 
system 3 [22] has built up to test the text cube ideas on 

the ASRS dataset. Several algorithms [13, 23, 7] are imple- 
mented in the system. 

In our experiments, both a case study (Section 6.1) and a 
performance study (Section 6.2) are given. All algorithms 
are implemented in C++ (Visual Studio 2005) with SQL 
(Microsoft SQL Server 2008), conducted in a 0.99GHz CPU 
and 1G memory PC. 

ASRS Dataset. 60,499 flight accident records that hap- 
pened during the past ten years are downloaded from the 
ASRS database. Outliers are removed. Each record consists 
of a pilot report (i.e., document) and 56 attributes, among 

2 http://asrs.arc. nasa.gov/ 

3 http://inextcube.cs. uiuc.edu/nasa/ 



which we use 10 categorical attributes as the dimensions 
in our text cube. The 10 dimensions are ‘Date’, ‘State’, 
‘Person’, ‘Weather’, ‘Light’, ‘Engine Make Model’, ‘Flight 
Phase’, ‘Problem Primary Area’, ‘Event Anomaly Type’ and 
‘Resolutory Action’. 

Preprocessing. 39,272 words/phrases are extracted from 
pilot reports, and 100 topics are generated by Latent Dirich- 
let Allocation [2]. A text cube is built on the ASRS dataset, 
which contains 16.67 trillion cells, among which 1,677,587 
are non-empty cells. 

6.1 Case Study 

Fatigue is defined as ‘a non-pathologic state resulting in a 
decreased ability to maintain function or workload due to 
mental or physical stress.’ Fatigue is a threat to aviation 
safety because of the impairments to alertness and perfor- 
mance it creates, which is a normal response to many condi- 
tions common to flight operations because of sleep loss, shift 
work, and long duty cycles [4, 14, 9]. 

6.1.1 Analysis of Relevant Topics 

Given the keyword query (‘fatigue’, ‘tired’), we calculate 
the relevance score for each topic by Equation 2. Topic 
85 is the one with the highest relevance score, whose word 
distribution is shown in Table 4. 


day 

0.0888 

hour 

0.0619 

trip 

0.0417 

time 

0.0416 

duty 

0.0346 

rest 

0.0301 

night 

0.0288 

minute 

0.0279 

leg 

0.0258 

fatigue 

0.0182 

late 

0.0156 

schedule 

0.0150 

morning 

0.0140 

long 

0.0138 

day 

0.0133 

fly 

0.0124 

early 

0.0115 

tired 

0.0111 

sleep 

0.0104 

previous 

0.0102 

hotel 

0.0092 

crew 

0.0088 

period 

0.0088 

arrive 

0.0087 

home 

0.0079 

legal 

0.0072 

block 

0.0062 

total 

0.0041 

evening 

0.0041 

delay 

0.0041 

work 

0.0040 

leave 

0.0038 

break 

0.0038 

assignment 

0.0038 

overnight 

0.0037 

reserve 

0.0037 

desk 

0.0036 

sick 

0.0036 

layover 

0.0029 

body 

0.0029 

month 

0.0028 

reduce 

0.0027 

show 

0.0026 

afternoon 

0.0026 

sequence 

0.0024 

company 

0.0023 

pair 

0.0022 

depart 

0.0022 

international 

0.0022 

begin 

0.0021 

room 

0.0021 

factor 

0.0020 

week 

0.0020 

pick 

0.0019 

assign 

0.0018 

deadhead 

0.0018 

wait 

0.0018 

bed 

0.0018 

awake 

0.0018 

flight 

0.0017 


Table 3: The Word Distribution of Topic ‘Fatigue’. 


Below is an interesting pilot report that talked about ‘fa- 
tigue’, which mentioned several words in Table 4 such as 
‘fatigue’, ‘sleep’, ‘hour’, ‘rest’, ‘depart’, ‘leg’, ‘break’, ‘duty’, 
‘day’, etc.-. 

Example 4. FATIGUING ASSIGNMENTS. AFTER I 
LNDGIN ZZZ I WENT TO SLEEP AT X A 00 ZZZ1 TIME. 
MY PRE ALL-NIGHTER NAP WAS AT XN00 ZZZ1 TIME. 
MY POST ALL-NIGHTER REST WAS AT XD00 ZZZ1 
TIME AND MY REST BEFORE AN XA00 LAX DEP 


WAS AT XD00 ZZZ1 TIME. THAT IS 4 DIFFERENT 
SLEEP TIMES IN LESS THAN 48 HRS. UPON LNDG IN 
ZZZ1 I WAS EXPECTED TO DO 2 MORE LEGS WITH 
2 HR BREAKS FOR A 12+ HR DUTY DAY ON DAY 5! 
THIS WOULD HAVE BECOME UNSAFE AND I CALLED 
IN FATIGUED. 

6.1.2 Analysis of Correlated Topics 

The topic 85 shown in Table 3 describes general terms re- 
lated to ‘fatigue’, from which we can clearly infer that ‘long 
duty’ and ‘insufficient rest’ are two major factors that cause 
‘fatigue’. However, many other reasons that lead to or re- 
lated with ‘fatigue’ are not so obvious as topic 85. By min- 
ing correlated topics, we are able to discover more detailed, 
complex and various reasons for ‘fatigue’. 

Concretely, we enumerate all cells whose aggregated dimen- 
sions are no more than 2; for each cell, we compute the 
correlation score between topic ‘fatigue’ and other topics in 
that cell; finally, we rank topics (associated with cells) ac- 
cording to their correlation score (see Table 4 4 ). One row 
should be understood as: the topic t (i.e., the 3 rd column) 
is correlated with topic ‘fatigue’ in the cell c (i.e., the 2" d 
column) with the correlation score beng s (i.e., the 1 st col- 
umn). A sample pilot report is given (i.e., the 4 th column) 
as the supporting evidence, as well as human interpretation 
(i.e., the 5 th column). 

Table 4 actually shows the effectiveness of our approach, for 
example, on row 2, which has a relatively high correlation 
score, the most correlated topic has higher probabilities for 
the words ‘stress’, etc.. In the representative report, the 
pilot first explained he/she forgot something, then later ac- 
tually mentioned it may be because of fatigue, and there 
was not enough rest between the flights. And the fact that 
this correlation is high in cell ‘[Flight Phase]: cruise level’ 
may indicate pilots are most influenced by fatigue during 
that phase. On row 3, the correlated topic contains ‘lack’, 
‘focus’, and also ‘fatigue’; and it is also common sense that 
in ‘[Weather]: Fog’, the pilots or drivers are easily tired, and 
thus lost focus. 


6.2 Performance Study 

6.2.1 Experiment 1: Storage Size 

Figure 1 reports the storage costs while varying (i) the thresh- 
old e in the T-CUBING algorithm, and (ii) the number of 
dimensions of the text cube. FULL is the cube with full ma- 
terialization, and CUBE20, CUBE60 and CUBE100 are the 
text cubes with e being 20, 60 and 100, respectively. The 
number of dimensions varies from 2 to 10. 

We can observe and/or verify two facts: (i) In principle, the 
smaller e is, the more cells need to be pre-computed, thus 
leading to a larger storage cost. As verified in Figure 1, 
FULL always has the largest storage size, since FULL is 
equivalent to a text cube with e being 1. To the opposite, 

4 For the first row, second column, * means the cell that 
aggregates all dimensions 




Score 

Discovered Cell 

Correlated 

Topic 

Sample Document 

Human In- 
terpretation 

1.000 

* 

day, hour, trip, 
time, duty, rest, 
night, leg, fatigue, 
min, late 

FLIGHT HAD PREVIOUSLY 
BEEN DELAYED AND WE 
HAD MINIMUM REST PERIOD 
COMING UP, LESS THAN 9 
HOURS. 

Duty Cycle. 

0.6092 

[Flight Phase]: cruise level; [Reso- 
lutory Action]: equipment problem 
dissipated 

year, good, time, 
month, experi- 

ence, fly, past, 
stress 

ALTHOUGH I AM COM- 
PLETELY FAMILIAR WITH THE 
AIRSPACE; I COMPLETELY 
FORGOT ABOUT THAT SEG- 
MENT OF THE CLASS B 

Attitude 

0.5509 

[Weather] : fog ; [Resolutory Action] : 
issue new clearance 

awareness, fail- 
ure, attention, 

realize, focus, 

fatigue, lack 

AS WE FLEW FURTHER OUT 
OVER THE WATER; THE 
CLOUDS SEEMED TO BE 
SLIGHTLY LOWER IN PLACES 

Illusion 

0.5312 

[Resolutory Action]: took evasive 

action; [Make Model]: airbus 

pos, radar, su- 
pervise, trainee, 
CTLR, error, 

alert, busy, sector 

I THINK HE CONFUSED 10:00 
POS AND 2:00 POS; AS THEY 
ARE BOTH 20 DEGREES OFF 
OF OUR NOSE. 

Proficiency 

0.5085 

[Flight Phras]: landing; [Event 

Anomaly]: landing without clear- 

ance 

time, high, work- 
load, unable, lack, 
delay, difficulty, 
additional 

I NEGLECTED TO RESELECT 
THE OTHER SIDE OF THE RA- 
DIO TO TALK TO TWR AS I 
WAS BUSY WITH THE CHK- 
LIST. 

Taskload 


Table 4: Correlated Topics with the Topic ‘Fatigue’ 



Figure 1: Storage Cost 


CUBE100 is always the smallest one. (ii) The space com- 
pression ratio from BASIC to CUBE20 is larger than the 
one from CUBE20 to CUBE60, than the one from CUBE60 
to CUBE100. Although the storage size is monotonically 
reduced when e increases, such reduction becomes trivial 
when e is sufficiently big. How to select an appropriate e to 
balance the time and the space costs is still an interesting 
question left for future work, (iii) The compression ratio 
increases when the number of dimensions increases. The 
reason is that the text cube with more dimensions has a 
smaller average number of documents in cells, which results 
in less pre-computation. 


6.2.2 Experiment 2: Query Processing Time 

We report the query processing time in text cubes with dif- 
ferent threshold (i.e., CUBE 20, CUBE60 and CUBE100). 
BASIC is the baseline query processor which computes mea- 
sures by retrieving documents in the raw database. 



Cell Size 


-♦-BASIC 


^-CubelOO 


Figure 2: Query Time by Varing Cell Size 

In Figure 2, the average query processing time is shown 
as a function of cell size (the size of a cell is the number 
of aggregate documents in the cell). As expected, BASIC 
increases its processing time approximately linearly, while 
other curves vibrate as cell size increases. The reason can be 
explained from the materializing procedure of T-CUBING: 
at the very beginning, all base cells are precomputed; as 


the cell size increases, more and more cells need to be ac- 
cessed to answer queries; when CostQ reaches the threshold 
e, T-CUBING begins to precompute cells again. 



Figure 3: Query Time by Varing # Dimensions 

In Figure 3, the average query processing time is plotted as 
a function of the number of aggregated dimensions of the 
queried cell. As expected, BASIC increases its query time 
sharply, while CUBE20, CUBE60 and CUBE100 rise com- 
paratively smoothly with the help of the text cube, among 
which CUBE20 is the fastest. The similar vibration behavior 
happened as in Figure 2. 


7. CONCLUSIONS 

We demonstrated a novel method to generate correlated 
topic models from a large corpora of text reports which 
can be analyzed in the subspaces defined by cells defined 
by the intersection of numerous fixed fields and discovered 
correlated topics. The research is motivated by the need 
to develop technologies to help uncover aviation safety in- 
cidents before they happen based on large repositories of 
aviation safety narratives. These narratives are also anno- 
tated with numerous fixed fields, thus giving an excellent 
application domain for this research. The large number of 
potential cells in the resulting text cube demand that the 
computational complexity be sufficiently bounded. We ap- 
plied this novel system to the analysis of crew fatigue and 
show potential factors that may be related to fatigue issues. 
Although a full analysis of crew fatigue and its contribut- 
ing and correlated factors is out of the scope of this paper, 
the technologies described can be used in future studies to 
understand these issues. 
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